Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pick up Propolis 50cb28f5 #5906

Merged
merged 3 commits into from
Jun 18, 2024
Merged

pick up Propolis 50cb28f5 #5906

merged 3 commits into from
Jun 18, 2024

Conversation

gjcolombo
Copy link
Contributor

Update to Propolis commit 50cb28f5 to obtain the following updates:

50cb28f server: separate statuses of inbound and outbound migrations (#661)
61e1481 Don't return None from BlockData::block_for_req when attachment is paused (#711)
79e243b Build and clippy fixes for Rust 1.79 (#712)
3c0f998 Modernize fw_cfg emulation
7a65be9 Update and prune dependencies

50cb28f5 changes the Propolis instance status and migration status APIs to report both inbound and outbound migration status. Update sled agent accordingly.

Tests: cargo nextest; will also run a couple of instances on a dev cluster before merging.

@hawkw hawkw self-requested a review June 17, 2024 20:27
Comment on lines 121 to 138
// Decide which of Propolis's reported migrations to pay attention to.
// If this Propolis is currently a migration target (evidenced by its ID
// appearing in the `dst_propolis_id` field in the instance record),
// look at the current migration in. Otherwise, look at the reported
// migration out.
let role = if instance_runtime
.dst_propolis_id
.is_some_and(|id| id == propolis_id)
{
MigrationRole::Target
} else {
MigrationRole::Source
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is going to have to change substantially once sled-agent no longer has an InstanceRuntimeState to make this determination. I think we ought to be able to do this instead using the migration_state in InstanceStates, which should also have the ID of the currently active migration (which would let us check whether the migration in or migration out refers to the current migration).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, thanks! d9b4ae8 should clean this up a bit.

@hawkw
Copy link
Member

hawkw commented Jun 17, 2024

these clippy failures should be fixed by #5907, btw, so rebasing should fix that.

@gjcolombo gjcolombo marked this pull request as ready for review June 17, 2024 23:36
@gjcolombo
Copy link
Contributor Author

This looks good in local testing (I've gone through a few instance state transitions and checked to make sure that instance reboot works properly now that the fix for oxidecomputer/propolis#709 is in place). Still need to pick up the fix for #5907 in this branch, but otherwise this is ready for review.

@gjcolombo gjcolombo requested a review from hawkw June 17, 2024 23:45
@hawkw
Copy link
Member

hawkw commented Jun 17, 2024

looks great! thanks for changing it to use the MigrationState instead of the InstanceRuntimeState — that'll make my life easier later!

@gjcolombo
Copy link
Contributor Author

I appear to have upset the migration integration tests; will investigate.

@gjcolombo
Copy link
Contributor Author

The problem is an assertion failure caused by d9b4ae8 (newest commit in this PR). The simplest fix may just be to revert that commit, but I'll think about it some more.

The basic issue is that InstanceStates::apply_propolis_observation assumes that the "migration in succeeded" branch of its logic will only be reached once per migration target. (This assumption is encoded in InstanceStates::switch_propolis_id_to_target, which asserts that the new active Propolis ID it's writing isn't None.) Previously, this was ensured by having ObservedPropolisState::new key off of the migration ID in sled agent's copy of the InstanceRuntimeState: once migration succeeds, sled agent clears the instance's migration ID, so subsequent observations will report that no migration is in progress even if Propolis continues to report on the completed migration.

d9b4ae8 breaks this assumption by making ObservedPropolisState use sled agent's MigrationRuntimeState to decide if a migration is in progress. This migration state is not cleared by InstanceStates::clear_migration_ids, so the existing "migration succeeded" sequence no longer closes the door on itself properly. The fix is unfortunately not as simple as clearing more fields in clear_migration_ids--the MigrationRuntimeState needs to be Some at the end of apply_propolis_observation, or sled agent won't send updated migration state to Nexus.

This is, of course, way more complicated than it needs to be: sled agent shouldn't be in the business of updating instance runtime state at all, and if it weren't there would be no assertions to trip. Unfortunately, we're not there quite yet, so we'll need to stitch something together to keep this working in the meantime.

@gjcolombo gjcolombo force-pushed the gjcolombo/propolis-update branch from d9b4ae8 to 26cb441 Compare June 18, 2024 16:25
The instance runtime state machine in sled agent requires that an
inbound migration appear to be completed only once (so that there's only
one "move the destination Propolis ID to the current Propolis ID and
clear the destination ID" operation per successful migration). This
operation (clearing the instance's active migration ID) is itself what
prevents future Propolis state updates from being interpreted as a
completed migration (once the migration ID is cleared from sled agent's
runtime state, it will no longer match the old ID that Propolis is
reporting).

In the brave new world where sled agent doesn't track instance runtime
state at all, substantially all of this code will be removed: there will
be no instance runtime state to transition to anything; sled agent will
just pass migration statuses straight through to Nexus.
@gjcolombo
Copy link
Contributor Author

I haven't rebased this PR on top of #5907 yet (so as not to lose my place in the buildomat queue), but I've done so locally, run cargo xtask clippy, and see no errors.

@gjcolombo gjcolombo force-pushed the gjcolombo/propolis-update branch from 26cb441 to ecc9c33 Compare June 18, 2024 17:21
@hawkw
Copy link
Member

hawkw commented Jun 18, 2024

As we discussed on Matrix, I'm fine with the change back to using InstanceRuntimeState for now so that everything can keep working. I'll get rid of the use of InstanceRuntimeState when removing it from sled-agent in #5749.

@gjcolombo gjcolombo enabled auto-merge (squash) June 18, 2024 19:21
@gjcolombo gjcolombo merged commit 7fb6cb4 into main Jun 18, 2024
21 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/propolis-update branch June 18, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants